Skip to content

[bgp/agg]: Add BGP aggregate address test cases for Config Persistence and Recovery #23347

Merged
shixizhang merged 4 commits intosonic-net:masterfrom
shixizhang:addbgptest-pr
Apr 1, 2026
Merged

[bgp/agg]: Add BGP aggregate address test cases for Config Persistence and Recovery #23347
shixizhang merged 4 commits intosonic-net:masterfrom
shixizhang:addbgptest-pr

Conversation

@shixizhang
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Add new test file test_bgp_aggregate_address_resilience.py (Test Group 5) that validates BGP aggregate-address configuration persistence and recovery across various disruption scenarios. These 5 new test cases verify that aggregate address configuration written via GCU survives BGP container restarts, config reloads, cold reboots, warm reboots, and BBR state transitions.

New test cases:

  • TC 5.1 test_aggregate_persists_bgp_container_restart: Aggregate config survives BGP container restart; CONFIG_DB + STATE_DB + FRR are consistent after recovery.
  • TC 5.2 test_aggregate_persists_config_reload: Aggregate config (with summary-only=true) survives config save + config reload.
  • TC 5.3 test_aggregate_persists_config_save_and_reboot: IPv6 aggregate config survives config save + cold reboot.
  • TC 5.4 test_aggregate_bbr_required_inactive_persists_bgp_restart: BBR-required aggregate stays inactive after BGP restart when BBR is disabled; activates once BBR is enabled.
  • TC 5.5 test_aggregate_persists_warm_reboot: Aggregate config survives warm reboot.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Existing BGP aggregate-address tests cover configuration validation and route propagation behavior, but there are no tests verifying that aggregate address configuration persists across operational disruptions such as BGP container restarts, config reloads, and device reboots. This PR fills that gap by adding resilience tests that validate CONFIG_DB, STATE_DB, and FRR running-config consistency after each disruption type.

How did you do it?

  • Created test_bgp_aggregate_address_resilience.py reusing existing helpers and fixtures from test_bgp_aggregate_address.py (AggregateCfg, gcu_add_aggregate, gcu_remove_aggregate, verify_bgp_aggregate_consistence, verify_bgp_aggregate_cleanup, dump_db, and the setup_teardown checkpoint/rollback fixture).
  • Added a bgp_neighbors fixture to discover BGP neighbor IPs for session-state polling after disruptions.
  • Pre-disruption verification only checks CONFIG_DB (GCU write is synchronous). Post-disruption verification checks the full stack (CONFIG_DB + STATE_DB + FRR) after bgpcfgd has re-processed the config.
  • Added wait_for_aggregate_state() helper to handle the asynchronous bgpcfgd STATE_DB population after disruptions.
  • All test cases include proper cleanup in finally blocks with graceful fallback to checkpoint rollback.

How did you verify/test it?

Ran all test cases on a physical m1-48 testbed with Arista EOS neighbors.
image

Any platform specific information?

No platform-specific dependencies. Tests use GCU for configuration and standard SONiC reboot/reload utilities, which are platform-agnostic.

Supported testbed topology if it's a new test case?

t1, m1 (declared via @pytest.mark.topology("t1", "m1"))

Documentation

Aligned with BGP-Aggregate-Address test plan

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@shixizhang shixizhang enabled auto-merge (squash) March 26, 2026 11:47
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

… patch

On KVM/VS the warm-reboot script has a 1s timeout for docker-exec health
checks that is too tight, causing fpmsyncd and orchagent to crash during
warm reboot.  This is the same issue that AdvancedReboot handles by
patching the timeout to 5s (tests/common/fixtures/advanced_reboot.py).

Apply the same sed patch to the warm-reboot script before rebooting,
and restore the original safe_reboot=True call with full recovery checks.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shixizhang, nice test cases! One issue:

\pytest.mark.device_type('vs')\ contradicts physical test verification - The PR description says tests were validated on a physical m1-48 testbed, but \device_type('vs')\ restricts execution to VS devices when \--device_type\ is specified. Please use \pytest.mark.device_type('physical')\ or remove the marker entirely.

Also a minor concern: TC5.2/TC5.3 do \sudo config save -y\ before cleanup. If the test fails after save but before teardown, the aggregate entry persists across reboots. Please ensure the \setup_teardown\ fixture handles rollback of saved config.

@shixizhang shixizhang closed this Mar 30, 2026
auto-merge was automatically disabled March 30, 2026 10:15

Pull request was closed

@shixizhang shixizhang reopened this Mar 30, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

shixizhang added a commit to shixizhang/sonic-mgmt that referenced this pull request Mar 31, 2026
- Remove device_type('vs') marker that contradicts physical testbed validation
- Add local setup_teardown fixture with config save after rollback to prevent
  aggregate entries from persisting in on-disk config across reboots

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 31, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@shixizhang
Copy link
Copy Markdown
Contributor Author

Hi @shixizhang, nice test cases! One issue:

\pytest.mark.device_type('vs')\ contradicts physical test verification - The PR description says tests were validated on a physical m1-48 testbed, but \device_type('vs')\ restricts execution to VS devices when --device_type\ is specified. Please use \pytest.mark.device_type('physical')\ or remove the marker entirely.

Also a minor concern: TC5.2/TC5.3 do \sudo config save -y\ before cleanup. If the test fails after save but before teardown, the aggregate entry persists across reboots. Please ensure the \setup_teardown\ fixture handles rollback of saved config.

Hi @shixizhang, nice test cases! One issue:

\pytest.mark.device_type('vs')\ contradicts physical test verification - The PR description says tests were validated on a physical m1-48 testbed, but \device_type('vs')\ restricts execution to VS devices when --device_type\ is specified. Please use \pytest.mark.device_type('physical')\ or remove the marker entirely.

Also a minor concern: TC5.2/TC5.3 do \sudo config save -y\ before cleanup. If the test fails after save but before teardown, the aggregate entry persists across reboots. Please ensure the \setup_teardown\ fixture handles rollback of saved config.

addressed the issue as new iteration.

shixizhang added a commit to shixizhang/sonic-mgmt that referenced this pull request Mar 31, 2026
- Remove device_type('vs') marker that contradicts physical testbed validation
- Add local setup_teardown fixture with config save after rollback to prevent
  aggregate entries from persisting in on-disk config across reboots

Co-authored-by: Copilot <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

shixizhang added a commit to shixizhang/sonic-mgmt that referenced this pull request Mar 31, 2026
- Remove device_type('vs') marker that contradicts physical testbed validation
- Add local setup_teardown fixture with config save after rollback to prevent
  aggregate entries from persisting in on-disk config across reboots

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

- Remove device_type('vs') marker that contradicts physical testbed validation
- Add local setup_teardown fixture with config save after rollback to prevent
  aggregate entries from persisting in on-disk config across reboots

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

- Sync is_virtual_platform(duthost) fix for warm reboot KVM check
- Change topology mark from (t1, m1) to m1 only

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@shixizhang shixizhang requested a review from StormLiangMS March 31, 2026 09:35
@shixizhang shixizhang enabled auto-merge (squash) April 1, 2026 02:20
@shixizhang
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Thanks for the review! Both issues have been addressed:

device_type('vs') removed — Topology is now restricted to m1 only, no device_type marker.
Config save rollback — Added a dedicated setup_teardown fixture that creates a checkpoint before tests and on teardown runs rollback_or_reload + config save -y, so even if a test fails after save but before cleanup, the on-disk config is restored to a clean state.
Please take another look when you get a chance.

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shixizhangApproved

All prior review comments addressed in commits a343dab and 2babda5:

  • \device_type('vs')\ removed — replaced with \is_virtual_platform(duthost)\ runtime check in TC 5.5 only (correct approach)
  • ✅ *Topology narrowed to \m1* — appropriate for aggregate-address tests that need real multi-ASIC or physical topology
  • \setup_teardown\ fixture now does \config save -y\ after rollback — prevents stale aggregate entries from persisting in on-disk config across reboots

Code quality observations:

  • Clean reuse of helpers from \ est_bgp_aggregate_address.py\ — no duplication
  • Good pre/post disruption verification strategy (CONFIG_DB only pre, full stack post)
  • \wait_for_aggregate_state()\ helper with polling handles the async bgpcfgd race correctly
  • All 5 test cases have proper try/finally cleanup with graceful fallback to checkpoint rollback
  • TC 5.4 (BBR) properly saves and restores BBR default state
  • Warm reboot VS timeout patch mirrors the established pattern from \�dvanced_reboot.py\

CI: All required checks passing. The only failure is the OPTIONAL t1-lag-vpp test — unrelated.

LGTM — ready to merge.

@shixizhang shixizhang merged commit 6f61a88 into sonic-net:master Apr 1, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants